Check Your Chances To Get A Loan.¶
Hi, this is Afroz Samee. Thank you for your interest in reading this notebook. In this notebook, I walked you through the process I followed to predict whether a loan was granted successfully or not. I used the logistic regression machine learning algorithm and the Python libraries Pandas and NumPy for data cleaning. For data visualization, I worked with Matplotlib and Seaborn, and for more interactive analysis, I used Plotly.
This may sound technical and new to some of you, but trust me, it isn’t. By the end of this notebook, you will have learned or gained a clearer understanding of machine learning techniques.
How is this Notebook Different?¶
The main question is: how is my notebook different from others available online? Simply put, the uniqueness of this notebook lies in the fact that it not only explains what steps were taken to achieve the result but also why these steps were necessary. Additionally, I provided brief explanations for various functions and comparisons on which techniques are better suited for specific use cases. There are no doubt numerous tutorials on loan prediction, especially on Kaggle; however, many of these tutorials offer limited explanations and, at times, feature an overuse of graphs without much added value. For example, these notebook1, notebook2 is primarily filled with graphs without sufficient context.
While the prediction accuracy of this model may not be perfect, I believe this notebook serves as a helpful guide to creating and improving a machine learning model. It demonstrates the steps involved in modeling, offers a framework for building meaningful explanations, and encourages further exploration of algorithms to compare their accuracies and false-positive rates.
Let me briefly outline the steps I followed in this notebook:
- Importing Required Libraries
- Loading and Understanding Your Dataset
- Data Cleaning
- Exploring Data Types & Values of Columns
- Exploratory Data Analysis
- Preprocessing the Data
- Feature Engineering and Scaling Techniques
- Understanding the Correlation Between Columns
- Applying the Machine Learning Model
- Prediction Summary
- References
#extracting the zip file
from zipfile import ZipFile
file_name = 'playground-series-s4e10.zip'
with ZipFile(file_name ,'r') as zip:
zip.extractall()
print('completed')
completed
Importing and Setting Up the Required Libraries¶
Below are the Python libraries I used for data analysis, manipulation, and training:
Pandas: This library was used for data manipulation and analysis, especially when working with dataframes—2D tabular data with labeled rows and columns. It also provided functions for handling files, dealing with missing data, removing duplicates, grouping rows or columns, and applying aggregate functions.
NumPy: This library provided numerical computation capabilities, particularly for working with multi-dimensional arrays and applying mathematical functions.
Matplotlib: I used this library to create interactive 2D visualizations such as line plots, bar charts, histograms, and scatter plots.
Seaborn: Built on top of Matplotlib, Seaborn was used to visualize statistical data. In this notebook, I worked with box plots, heatmaps, and bar plots, among others.
Plotly: Plotly offered interactive, 3D plotting and dynamic visualizations, allowing me to create animated charts, sliders, and fully interactive dashboards. It supports large datasets and comes with features like zooming, hovering, and more.
Scikit-learn (sklearn): Built on top of NumPy, SciPy, and Matplotlib, this library provided efficient tools for data analysis. I used it for logistic regression, data preprocessing techniques such as scaling and encoding, and model evaluation with metrics like accuracy score and cross-validation.
#Import and setup required libraries or packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")
Loading and Exploring the Datasets¶
Data Files¶
Training Data: This dataset contained both the independent variables (features) and the dependent variable (target). The machine learning models were trained on this data, learning patterns that would be applied later to the testing data to predict the target variable.
Testing Data: This dataset consisted only of the independent variables, meaning it contained all the feature columns without the target variable.
Reference: Kaggle's DataSet Loan Approval Prediction (Playground Series - Season 4, Episode 10)
Knowing the Datasets¶
Before applying any machine learning algorithm, it was essential to follow some initial steps. The first and most critical step was understanding the dataset thoroughly. This involved knowing its shape, the data types of each feature, whether the target variable was numerical or categorical, and gaining a clear understanding of what you would be working on.
#The files are read and loaded into the dataframe using pandas which is denoted as 'pd' in my notebook
trainLoan_df = pd.read_csv('train.csv')
testLoan_df = pd.read_csv('test.csv')
#.head() will display first 5 rows of the dataframe
display(trainLoan_df.head())
display(testLoan_df.head())
| id | person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | loan_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37 | 35000 | RENT | 0.0 | EDUCATION | B | 6000 | 11.49 | 0.17 | N | 14 | 0 |
| 1 | 1 | 22 | 56000 | OWN | 6.0 | MEDICAL | C | 4000 | 13.35 | 0.07 | N | 2 | 0 |
| 2 | 2 | 29 | 28800 | OWN | 8.0 | PERSONAL | A | 6000 | 8.90 | 0.21 | N | 10 | 0 |
| 3 | 3 | 30 | 70000 | RENT | 14.0 | VENTURE | B | 12000 | 11.11 | 0.17 | N | 5 | 0 |
| 4 | 4 | 22 | 60000 | RENT | 2.0 | MEDICAL | A | 6000 | 6.92 | 0.10 | N | 3 | 0 |
| id | person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58645 | 23 | 69000 | RENT | 3.0 | HOMEIMPROVEMENT | F | 25000 | 15.76 | 0.36 | N | 2 |
| 1 | 58646 | 26 | 96000 | MORTGAGE | 6.0 | PERSONAL | C | 10000 | 12.68 | 0.10 | Y | 4 |
| 2 | 58647 | 26 | 30000 | RENT | 5.0 | VENTURE | E | 4000 | 17.19 | 0.13 | Y | 2 |
| 3 | 58648 | 33 | 50000 | RENT | 4.0 | DEBTCONSOLIDATION | A | 7000 | 8.90 | 0.14 | N | 7 |
| 4 | 58649 | 26 | 102000 | MORTGAGE | 8.0 | HOMEIMPROVEMENT | D | 15000 | 16.32 | 0.15 | Y | 4 |
#.shape, gives the number of rows and columns in the dataframe
print("Train Dataset shape:",trainLoan_df.shape)
print("Test Dataset shape:",testLoan_df.shape)
Train Dataset shape: (58645, 13) Test Dataset shape: (39098, 12)
#.info() consize summary of the dataframe
trainLoan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58645 entries, 0 to 58644 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58645 non-null int64 1 person_age 58645 non-null int64 2 person_income 58645 non-null int64 3 person_home_ownership 58645 non-null object 4 person_emp_length 58645 non-null float64 5 loan_intent 58645 non-null object 6 loan_grade 58645 non-null object 7 loan_amnt 58645 non-null int64 8 loan_int_rate 58645 non-null float64 9 loan_percent_income 58645 non-null float64 10 cb_person_default_on_file 58645 non-null object 11 cb_person_cred_hist_length 58645 non-null int64 12 loan_status 58645 non-null int64 dtypes: float64(3), int64(6), object(4) memory usage: 5.8+ MB
trainLoan_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 58645.0 | 29322.000000 | 16929.497605 | 0.00 | 14661.00 | 29322.00 | 43983.00 | 58644.00 |
| person_age | 58645.0 | 27.550857 | 6.033216 | 20.00 | 23.00 | 26.00 | 30.00 | 123.00 |
| person_income | 58645.0 | 64046.172871 | 37931.106979 | 4200.00 | 42000.00 | 58000.00 | 75600.00 | 1900000.00 |
| person_emp_length | 58645.0 | 4.701015 | 3.959784 | 0.00 | 2.00 | 4.00 | 7.00 | 123.00 |
| loan_amnt | 58645.0 | 9217.556518 | 5563.807384 | 500.00 | 5000.00 | 8000.00 | 12000.00 | 35000.00 |
| loan_int_rate | 58645.0 | 10.677874 | 3.034697 | 5.42 | 7.88 | 10.75 | 12.99 | 23.22 |
| loan_percent_income | 58645.0 | 0.159238 | 0.091692 | 0.00 | 0.09 | 0.14 | 0.21 | 0.83 |
| cb_person_cred_hist_length | 58645.0 | 5.813556 | 4.029196 | 2.00 | 3.00 | 4.00 | 8.00 | 30.00 |
| loan_status | 58645.0 | 0.142382 | 0.349445 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
Data Cleaning¶
Checking for Null Values: The first step was to check if there were any null values in the datasets. In my case, both datasets did not have any null values to handle.
Handling Irrelevant Information: It was important to identify and remove any irrelevant or incorrect information early to prevent the model from overfitting. For example, in my dataset, one person’s age was listed as 123 years, which was clearly unrealistic, and since it was the only instance, I decided to drop that row. Similarly, another entry had an employment time that was greater than the person’s age, which was not possible.
print(f"Train data missing values if any: \n{trainLoan_df.isnull().sum()}")
Train data missing values if any: id 0 person_age 0 person_income 0 person_home_ownership 0 person_emp_length 0 loan_intent 0 loan_grade 0 loan_amnt 0 loan_int_rate 0 loan_percent_income 0 cb_person_default_on_file 0 cb_person_cred_hist_length 0 loan_status 0 dtype: int64
print(f"Test data missing values if any: \n{testLoan_df.isnull().sum()}")
Test data missing values if any: id 0 person_age 0 person_income 0 person_home_ownership 0 person_emp_length 0 loan_intent 0 loan_grade 0 loan_amnt 0 loan_int_rate 0 loan_percent_income 0 cb_person_default_on_file 0 cb_person_cred_hist_length 0 dtype: int64
#drop the rows if persons age is 123
print(f'Rows in dataframe having age equal to 123:\n{(trainLoan_df['person_age'] == 123).sum()}')
trainLoan_df = trainLoan_df[trainLoan_df['person_age'] != 123]
trainLoan_df.shape
Rows in dataframe having age equal to 123: 1
(58644, 13)
#drop the rows if persons employment is equal to 123
print(f'Rows in dataframe having person_emp_length equal to 123:\n{(trainLoan_df['person_emp_length'] == 123).sum()}')
trainLoan_df = trainLoan_df[trainLoan_df['person_emp_length'] != 123]
trainLoan_df.shape
Rows in dataframe having person_emp_length equal to 123: 2
(58642, 13)
#Select rows where 'person_emp_length' is less than 'person_age', to eliminate the number of rows having employment length greater than persons age.
print(f'Rows in dataframe having age of person less than employment period:\n{(trainLoan_df['person_emp_length'] > trainLoan_df['person_age']).sum()}')
condition = trainLoan_df['person_emp_length'] < trainLoan_df['person_age']
trainLoan_df = trainLoan_df.loc[condition]
trainLoan_df.shape
Rows in dataframe having age of person less than employment period: 0
(58642, 13)
#Select rows where 'cb_person_cred_hist_length' is less than 'person_age', to eliminate the number of rows having credit history greater than persons age.
print(f'Rows in dataframe having age of person less than employment period:\n{(trainLoan_df['cb_person_cred_hist_length'] > trainLoan_df['person_age']).sum()}')
condition = trainLoan_df['cb_person_cred_hist_length'] < trainLoan_df['person_age']
trainLoan_df = trainLoan_df.loc[condition]
trainLoan_df.shape
Rows in dataframe having age of person less than employment period: 1
(58641, 13)
Extracting Unique Values and Data Types¶
Extracting unique values and identifying data types early in the analysis offers several key advantages. This step allows us to identify categorical columns, which is essential for choosing the right encoding techniques. For instance, in this notebook, I apply Label Encoding and pd.get_dummies (similar to one-hot encoding) based on the unique values within each categorical column.
Understanding whether a categorical column is related to the target variable is also important. If a categorical column has numerous unique values but shows no significant relationship with the target, we might consider dimensionality reduction techniques to simplify the data. This process helps manage and balance the dataset more effectively, enabling the selection of encoding methods that preserve interpretability.
This approach also allows flexibility, enabling us to revisit and adjust encoding techniques as the analysis evolves.
non_numerical_columns = trainLoan_df.select_dtypes(include=['object']).columns.tolist()
for col in non_numerical_columns:
print(f"Column: {col}")
print(f"Unique Values: {trainLoan_df[col].unique()}")
print("\n")
Column: person_home_ownership Unique Values: ['RENT' 'OWN' 'MORTGAGE' 'OTHER'] Column: loan_intent Unique Values: ['EDUCATION' 'MEDICAL' 'PERSONAL' 'VENTURE' 'DEBTCONSOLIDATION' 'HOMEIMPROVEMENT'] Column: loan_grade Unique Values: ['B' 'C' 'A' 'D' 'E' 'F' 'G'] Column: cb_person_default_on_file Unique Values: ['N' 'Y']
non_numerical_columns = testLoan_df.select_dtypes(include=['object']).columns.tolist()
for col in non_numerical_columns:
print(f"Column: {col}")
print(f"Unique Values: {testLoan_df[col].unique()}")
print("\n")
Column: person_home_ownership Unique Values: ['RENT' 'MORTGAGE' 'OWN' 'OTHER'] Column: loan_intent Unique Values: ['HOMEIMPROVEMENT' 'PERSONAL' 'VENTURE' 'DEBTCONSOLIDATION' 'EDUCATION' 'MEDICAL'] Column: loan_grade Unique Values: ['F' 'C' 'E' 'A' 'D' 'B' 'G'] Column: cb_person_default_on_file Unique Values: ['N' 'Y']
Exploratory Data Analysis¶
#sns.countplot(x = 'loan_status', data = trainLoan_df)
def plot_countplot(data, column):
sns.countplot(data=data, x=column, palette = "Set2", hue = column)
plt.xlabel('Loan_status')
plt.ylabel('Count')
plt.title('Representation of Approved Loans over Rejected')
sns.despine()
plt.show()
plot_countplot(trainLoan_df, trainLoan_df.loan_status)
def numericalData_histPlot(df, feature, target):
fig = px.histogram(df, x = feature, color= target,
title = f'Distribution of {feature} by loan Status',
labels={feature: feature, 'count': 'Count', target: target},
hover_data={feature: True, target: True})
# Update the layout for better visuals
fig.update_layout(
xaxis_title=feature,
yaxis_title=target,
bargap=0.1, # Adjusts the gap between bars
hovermode="x unified", # Display hover data for both colors (stacked bars)
showlegend=True,
title_x=0.5, width=900, height=500
)
# Show the figure
fig.show()
numericalData_histPlot(trainLoan_df, 'person_age', 'loan_status')
numericalData_histPlot(trainLoan_df, 'loan_amnt', 'loan_status')
def plot_categorical_data_interactive(df, feature1, target):
"""Creating a bar graph and a pie chart using plotly express and implementing sub graphs using subplotting module from plotly"""
fig = make_subplots(rows=1, cols=2,
subplot_titles=(f'Countplot of {feature1} by {target}', f'Pie chart of {feature1}'),
specs=[[{"type": "xy"}, {"type": "domain"}]])
# Plotting bar graph using plotly express
bar = px.histogram(df, x=feature1, color=target, barmode='group',
labels={feature1: feature1, 'count': 'Count'},
color_discrete_sequence=px.colors.qualitative.Set1)
for trace in bar['data']:
fig.add_trace(trace, row=1, col=1)
# plotting pie chart using graph objects module from plotly
pie = df[feature1].value_counts()
pie_fig = go.Pie(labels=pie.index, values=pie.values,
marker=dict(colors=px.colors.qualitative.Set1),
hole=0.3)
fig.add_trace(pie_fig, row=1, col=2)
# complete layout
fig.update_layout(title_text=f'Comparison of {feature1} by {target}', showlegend=True,
title_x=0.5, width=1000, height=500)
fig.show()
plot_categorical_data_interactive(trainLoan_df, 'person_home_ownership', 'loan_status')
plot_categorical_data_interactive(trainLoan_df, 'loan_intent', 'loan_status')
Data Preprocessing Techniques¶
Here, I’m using two primary data preprocessing methods:
Label Encoding¶
Label encoding assigns a unique numerical label to each unique value in a categorical attribute. For instance, in our dataset, the attribute cb_person_home_ownership has values N and Y, which are transformed to 0 and 1 respective.
Get Dummies (One-Hot Encoding)¶
This method is useful when there is no inherent order or priority among categories. One-hot encoding creates a binary (0 or 1) column for each unique value in the feature, where 1 indicates the presence of that value and 0 indicates absence.
Note: If the feature has many unique values, this approach can lead to a large number of columns, which increases memory use and may slow down the model. The
pd.get_dummies()method in Pandas is similar toOneHotEncoderin Scikit-learn.
def preprocessing_trainingDataSet(df):
'''This function performs preprocessing using label encoding and the `get_dummies` method from pandas,
which is similar to one-hot encoding.'''
le = LabelEncoder()
df['cb_person_default_on_file'] = le.fit_transform(df['cb_person_default_on_file'])
df['loan_grade'] = le.fit_transform(df['loan_grade'])
loan_intent = pd.get_dummies(df['loan_intent'])
person_home_ownership = pd.get_dummies(df['person_home_ownership'])
df.drop(['loan_intent', 'person_home_ownership'], axis=1, inplace=True)
print(f'Shape of a dataset after droping type object columns:{df.shape}')
df_onehotEncoding = pd.concat([loan_intent,person_home_ownership],axis=1)
print(f'Shape of a dataset having preprocessing technique of getting summies had applied:{df_onehotEncoding.shape}')
df = pd.concat([df,df_onehotEncoding],axis=1)
print(f'Shape of dataset after adding preprocessed columns:{df.shape}')
return df
trainLoan_df = preprocessing_trainingDataSet(trainLoan_df)
display(trainLoan_df.head())
Shape of a dataset after droping type object columns:(58641, 11) Shape of a dataset having preprocessing technique of getting summies had applied:(58641, 10) Shape of dataset after adding preprocessed columns:(58641, 21)
| id | person_age | person_income | person_emp_length | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | ... | DEBTCONSOLIDATION | EDUCATION | HOMEIMPROVEMENT | MEDICAL | PERSONAL | VENTURE | MORTGAGE | OTHER | OWN | RENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 37 | 35000 | 0.0 | 1 | 6000 | 11.49 | 0.17 | 0 | 14 | ... | False | True | False | False | False | False | False | False | False | True |
| 1 | 1 | 22 | 56000 | 6.0 | 2 | 4000 | 13.35 | 0.07 | 0 | 2 | ... | False | False | False | True | False | False | False | False | True | False |
| 2 | 2 | 29 | 28800 | 8.0 | 0 | 6000 | 8.90 | 0.21 | 0 | 10 | ... | False | False | False | False | True | False | False | False | True | False |
| 3 | 3 | 30 | 70000 | 14.0 | 1 | 12000 | 11.11 | 0.17 | 0 | 5 | ... | False | False | False | False | False | True | False | False | False | True |
| 4 | 4 | 22 | 60000 | 2.0 | 0 | 6000 | 6.92 | 0.10 | 0 | 3 | ... | False | False | False | True | False | False | False | False | False | True |
5 rows × 21 columns
#preprocessing testing data set
testLoan_df = preprocessing_trainingDataSet(testLoan_df)
display(testLoan_df.head())
Shape of a dataset after droping type object columns:(39098, 10) Shape of a dataset having preprocessing technique of getting summies had applied:(39098, 10) Shape of dataset after adding preprocessed columns:(39098, 20)
| id | person_age | person_income | person_emp_length | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | DEBTCONSOLIDATION | EDUCATION | HOMEIMPROVEMENT | MEDICAL | PERSONAL | VENTURE | MORTGAGE | OTHER | OWN | RENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58645 | 23 | 69000 | 3.0 | 5 | 25000 | 15.76 | 0.36 | 0 | 2 | False | False | True | False | False | False | False | False | False | True |
| 1 | 58646 | 26 | 96000 | 6.0 | 2 | 10000 | 12.68 | 0.10 | 1 | 4 | False | False | False | False | True | False | True | False | False | False |
| 2 | 58647 | 26 | 30000 | 5.0 | 4 | 4000 | 17.19 | 0.13 | 1 | 2 | False | False | False | False | False | True | False | False | False | True |
| 3 | 58648 | 33 | 50000 | 4.0 | 0 | 7000 | 8.90 | 0.14 | 0 | 7 | True | False | False | False | False | False | False | False | False | True |
| 4 | 58649 | 26 | 102000 | 8.0 | 3 | 15000 | 16.32 | 0.15 | 1 | 4 | False | False | True | False | False | False | True | False | False | False |
Data Splitting and Scaling Techniques¶
Feature Scaling¶
In machine learning, features are mapped into n-dimensional space. If one variable (e.g., y) has much larger values than another (e.g., x), the Euclidean distance will be dominated by the larger variable, leading to potential loss of important information. Feature scaling solves this problem by normalizing or standardizing the data.
Reasons for Feature Scaling:¶
- To better approximate a theoretical distribution with desirable statistical properties.
- To spread out data more evenly.
- To make data distribution more symmetric.
- To linearize relationships between variables.
- To ensure constant variance (homoscedasticity).
RobustScaler¶
RobustScaler is a scaling technique that uses the interquartile range (IQR) and the median to scale features. It's useful for datasets with outliers, as it makes scaling more robust and reliable.
Formula
Where:
- median(x): The median of the feature.
- IQR(x): The interquartile range (75th percentile - 25th percentile).
X = trainLoan_df.drop(['id','loan_status'],axis=1)
y = trainLoan_df['loan_status']
# Split the data into training and validation sets (80% train, 20% validation)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply scaling method
scaler = RobustScaler()
# fit the training data and transform both testing and training data
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Correlation¶
Correlation measures the linear relationship between two variables and their dependencies. The correlation coefficient quantifies this relationship, ranging from -1 to 1. There are three types of correlation:
- Positive Correlation: Two variables move in the same direction (directly proportional), denoted by +1.
- Negative Correlation: Two variables move in opposite directions (inversely proportional), denoted by -1.
- No Correlation: No relationship between the variables.
By using a correlation matrix (corr()), we can identify which features have a strong relationship with the target variable. Features with no significant relationship can be dropped. However, since I had only 12 features, I chose to keep all of them for training.
df_corr = trainLoan_df.loc[:, ~trainLoan_df.columns.isin(['loan_status','id'])]
# Select all columns except specified ones
plt.figure(figsize=(19, 8))
sns.heatmap(df_corr.corr(), fmt = '.1f', cmap="coolwarm", annot=True)
plt.title('Correlation Matrix')
plt.show()
Logistic Regressor¶
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more independent variables Logistic Regression is used when the dependent variable (target) is categorical.
For example:
- To predict whether an email is spam (1) or (0).
- Whether online transaction is fraudulent (1) or not (0).
- Whether the loan is granted (1) or not (0).
# Initialize Logistic Regression model with class_weight='balanced'
model = LogisticRegression(max_iter=1000, random_state=42)
# Train the model
print("Training Logistic Regression...")
model.fit(x_train, y_train)
# Make predictions
y_pred = model.predict(x_test)
y_prob = model.predict_proba(x_test)[:, 1]
Training Logistic Regression...
¶
Prediction Summary¶
Confusion Matrix¶
A confusion matrix is used only for classification tasks. It summarizes the performance of a classification model by comparing actual and predicted labels. The matrix consists of the following four metrics:
| Predicted True | Predicted False | |
|---|---|---|
| Actual True | True Positive (TP) | False Negative (FN) |
| Actual False | False Positive (FP) | True Negative (TN) |
Accuracy¶
The Accuracy of a classification model is calculated as the ratio of correctly predicted observations (both true positives and true negatives) to the total number ofoservations:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
# Display evaluation metrics
print(f"\nAccuracy: {accuracy:.2f}")
print(f"AUC: {auc:.2f}")
print("\nClassification Report:\n", report)
# Plot confusion matrix
TFmatrix = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
TFmatrix.plot(cmap='RdPu')
plt.title('Confusion Matrix - Logistic Regression')
plt.show()
Accuracy: 0.90
AUC: 0.89
Classification Report:
precision recall f1-score support
0 0.92 0.98 0.94 10080
1 0.75 0.45 0.56 1649
accuracy 0.90 11729
macro avg 0.83 0.71 0.75 11729
weighted avg 0.89 0.90 0.89 11729
# Assume you have only two features for this plot to work well
plt.figure(figsize=(8, 6))
# Plot the test data points, colored by their actual labels
plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, cmap='viridis', edgecolor='k', alpha=0.6, label="Actual")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Actual vs Predicted Outcomes")
# Decision boundary
x_vals = np.linspace(x_test[:, 0].min(), x_test[:, 0].max(), 100)
y_vals = -(model.coef_[0][0] * x_vals + model.intercept_[0]) / model.coef_[0][1]
plt.plot(x_vals, y_vals, color="red", linewidth=2, label="Decision Boundary")
plt.legend()
plt.show()
# Predict the target values for the test dataset (drop the 'id' column first)
x_testPredict = testLoan_df.drop(['id'], axis=1)
x_testPredict = scaler.transform(x_testPredict)
test_predictions = model.predict(x_testPredict)
# Add the predicted values to your test dataset
testLoan_df['Predicted_Loan_Status'] = test_predictions
print(testLoan_df['Predicted_Loan_Status'].unique())
[1 0]
References¶
- Kaggle's DataSet Loan Approval Prediction (Playground Series - Season 4, Episode 10)__
- I have used this notebook tutorial to get initial ideas to create layout of this notebook
- I have used this kaggle notebook to get understanding on encoding techniques
- I have used this notebook as example reference
- To briefly understand about notebooks I have done reading on kaggle and implemented it
- Some of my content is taken from my previous work
!jupyter nbconvert --to html AI_CW.ipynb
[NbConvertApp] Converting notebook AI_CW.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 4 image(s). [NbConvertApp] Writing 6270816 bytes to AI_CW.html